Handling Cascading Failures: The Case for Topology-Aware Fault-Tolerance
نویسندگان
چکیده
Large distributed systems contain multiple components that can interact in sometimes unforeseen and complicated ways; this emergent “vulnerability of complexity” increases the likelihood of cascading failures that might result in widespread disruption. Our research explores whether we can exploit the knowledge of the system’s topology, the application’s interconnections and the application’s normal fault-free behavior to build proactive fault-tolerance techniques that could curb the spread of cascading failures and enable faster system-wide recovery. We seek to characterize what the topology knowledge would entail, quantify the benefits of our approach and understand the associated tradeoffs.
منابع مشابه
Reliability and Performance Evaluation of Fault-aware Routing Methods for Network-on-Chip Architectures (RESEARCH NOTE)
Nowadays, faults and failures are increasing especially in complex systems such as Network-on-Chip (NoC) based Systems-on-a-Chip due to the increasing susceptibility and decreasing feature sizes. On the other hand, fault-tolerant routing algorithms have an evident effect on tolerating permanent faults and improving the reliability of a Network-on-Chip based system. This paper presents reliabili...
متن کاملThe Relaxed-Ring: a Fault-Tolerant Topology for Structured Overlay Networks
Fault-tolerance and lookup consistency are considered crucial properties for building applications on top of structured overlay networks. Many of these networks use the ring topology for the organization or their peers. The network must handle multiple joins, leaves and failures of peers while keeping the connection between every pair of successor-predecessor correct. This property makes the ma...
متن کاملA Survey of QoS Multicasting Issues
The recent proliferation of QoS-aware group applications over the Internet has accelerated the need for scalable and efficient multicast support. In this article, we present a multicast “life-cycle” model which identifies the various issues that are involved in a typical multicast session. During the life-cycle of a multicast session, three important events can occur: group dynamics, network dy...
متن کاملCharacteristics , Impact , and Tolerance of Partial Disk Failures
Hard-disk failures are one of the primary causes of data loss in both enterprise storage systems and personal computers. Most disk failures are partial failures, where only some sectors are unavailable due to a latent sector error or some blocks are silently corrupted. This dissertation focuses on all aspects of such partial disk failures – their characteristics, their impact on different syste...
متن کاملTopological Analysis and Mitigation Strategies for Cascading Failures in Power Grid Networks
Recently, there has been a growing concern about the overload status of the power grid networks, and the increasing possibility of cascading failures. Many researchers have studied these networks to provide design guidelines for more robust power grids. Topological analysis is one of the components of system analysis for its robustness. This paper presents a complex systems analysis of power gr...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2005